2 research outputs found
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Despite the progress we have recorded in the last few years in multilingual
natural language processing, evaluation is typically limited to a small set of
languages with available datasets which excludes a large number of low-resource
languages. In this paper, we created SIB-200 -- a large-scale open-sourced
benchmark dataset for topic classification in 200 languages and dialects to
address the lack of evaluation dataset for Natural Language Understanding
(NLU). For many of the languages covered in SIB-200, this is the first publicly
available evaluation dataset for NLU. The dataset is based on Flores-200
machine translation corpus. We annotated the English portion of the dataset and
extended the sentence-level annotation to the remaining 203 languages covered
in the corpus. Despite the simplicity of this task, our evaluation in
full-supervised setting, cross-lingual transfer setting and prompting of large
language model setting show that there is still a large gap between the
performance of high-resource and low-resource languages when multilingual
evaluation is scaled to numerous world languages. We found that languages
unseen during the pre-training of multilingual language models,
under-represented language families (like Nilotic and Altantic-Congo), and
languages from the regions of Africa, Americas, Oceania and South East Asia,
often have the lowest performance on our topic classification dataset. We hope
our dataset will encourage a more inclusive evaluation of multilingual language
models on a more diverse set of languages. https://github.com/dadelani/sib-200Comment: under submissio
Coronal Heating as Determined by the Solar Flare Frequency Distribution Obtained by Aggregating Case Studies
Flare frequency distributions represent a key approach to addressing one of
the largest problems in solar and stellar physics: determining the mechanism
that counter-intuitively heats coronae to temperatures that are orders of
magnitude hotter than the corresponding photospheres. It is widely accepted
that the magnetic field is responsible for the heating, but there are two
competing mechanisms that could explain it: nanoflares or Alfv\'en waves. To
date, neither can be directly observed. Nanoflares are, by definition,
extremely small, but their aggregate energy release could represent a
substantial heating mechanism, presuming they are sufficiently abundant. One
way to test this presumption is via the flare frequency distribution, which
describes how often flares of various energies occur. If the slope of the power
law fitting the flare frequency distribution is above a critical threshold,
as established in prior literature, then there should be a
sufficient abundance of nanoflares to explain coronal heating. We performed
600 case studies of solar flares, made possible by an unprecedented number
of data analysts via three semesters of an undergraduate physics laboratory
course. This allowed us to include two crucial, but nontrivial, analysis
methods: pre-flare baseline subtraction and computation of the flare energy,
which requires determining flare start and stop times. We aggregated the
results of these analyses into a statistical study to determine that . This is below the critical threshold, suggesting that Alfv\'en
waves are an important driver of coronal heating.Comment: 1,002 authors, 14 pages, 4 figures, 3 tables, published by The
Astrophysical Journal on 2023-05-09, volume 948, page 7